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Abstract: Additive genetic variance in natural populations is commonly estimated using mixed 
models, in which the covariance of the genetic effects is modeled by a genetic similarity matrix derived 
from a dense set of markers. An important but usually implicit assumption is that the presence of any 
non-additive genetic effect only increases the residual variance, and does not affect estimates of additive 
genetic variance. Here we show that this is only true for panels of unrelated individuals. In case there is 
genetic relatedness, the combination of population structure and epistatic interactions can lead to inflated 
estimates of additive genetic variance. 


Mixed models with random genetic effects have become an important tool for studying the genetic 
architecture of complex traits. The covariance of the genetic effects is assumed to be proportional to a 
genetic similarity matrix (GSM) based on a dense set of markers, which is equivalent to assuming addi¬ 
tive effects for each standardized marker score. Under several additional assumptions, such as constant 
LD, this gives unbiased estimates of additive genetic variance and narrow-sense heritability (lU, m, 
Q). The sampling variance of such heritability estimators has been studied in Q and IS]. These results 
are however derived under the assumption that the model is correct, i.e. contains the true distribution 
of the data. Here we consider situations where this is not the case, and argue that potential sources of 
bias may be identified by computing the parameter value 6 which minimizes the Kullback-Leibler di¬ 
vergence KL{Q, Pg) = f log(Q/Pg)dQ with respect to the true distribution Q. It is a well known fact 
from statistics that in case of misspecification, i.e. when Q is not contained in the model {Pg : 9 € 0}, 
the maximum likelihood (ML) estimator converges to 6 (0, Q). Several authors have studied missing 
or phantom heritability resulting from undetected epistatic interactions between specific loci (ISl, Il9l . 
113 ). Here we investigate misspecification in a mixed model context, the covariance of the data being 
misspecified due to infinitesimal interactions or other non-additive effects. We consider three different 
scenarios (A-C), each time assuming that the additive and non-additive genetic variance is respectively 
0.4 and 0.2. The total phenotypic variance is assumed to be known and equal to 1, giving a narrow- and 
broad-sense heritability of 0.4 and 0.6. 

Scenario A: the phenotype Y = (Yi,..., Y^)' of n individuals is modeled using the multivariate 
normal distribution 

^cr\,o-% = ^ (0, + ^E^n), (1) 

where it' is a marker-based GSM, i„ the identity matrix, cr^ G [0,1] is the additive genetic variance 
and cr|; = 1 — cj^ is the residual variance. We assume however that Q, the actual distribution of Y, is 
the zero mean normal distribution with covariance OAK + 0.2{K ■ K) -I- 0.4i„, • being the Hadamard ( 
entry-wise) product. The ’epistatic’ matrix [K ■ K) is the covariance due to small epistatic interactions 
between all standardized marker scores (File SI). Since {K ■ K) does not equal the identity matrix In, 
Q is not contained in model [T] Hence, the ML-estimator will not converge to Q, but rather to the point 
{d\, a‘^) minimizing the KL-divergence KL{Q, ^|). For genetic similarity matrices derived from 
published data in maize, rice and Arabidopsis, ranges between 0.47 and 0.53 (Table [T]). Hence, the 
presence of epistatic interactions leads to inflated estimates of additive genetic variance. For a panel of 
simulated unrelated individuals, is 0.40, which is due to the much smaller off-diagonal elements of 
K, making K ■ K almost indistinguishable from 

Scenario B: a plant trait is phenotyped on r genetically identical replicates. Following Q, the 
observations Y = (Yu,..., Ynr)' are modeled by the normal distribution 

.P (72 ^^2 = A^(0, cr^ZiFZ'-f cr|;/„r), (2) 

Z being an incidence matrix assigning plants to genotypes. The true distribution Q is multivariate normal 
with covariance OAZKZ' + 0.2ZZ' + OAInr, i-S- there are non-additive (not necessarily epistatic) 
effects with independent A^(0, 0.2) distributions. Such effects could be due to for example genotype- 
environment interaction. In contrast to model [T] (where Z = In and r = 1), ZZ' is different from Inr, 
and Q is not contained in model [2] Again, the value a\ minimizing KL-divergence is substantially larger 
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than 0.4 (Table[T]), and additive genetic variance will tend to be overestimated. Intuitively, this is because 
the block structure ZZ' is better captured by ZitTZ' than by the diagonal residual. 

Scenario C is a combination of A and B. To avoid the misspecification occurring in scenario B, the 
model 

-^0-2 = A^(0, cr^ZiTZ'+ CTgZZ'+ ctI/at) (3) 

is considered, extending Q with independent non-additive effects. This model has been used in the 
analysis of field trials ( ifTTI . HU), as well as genomic prediction ( |[T3l . Ifldll . ifTSlO . If in fact the non¬ 
additive effects have covariance K ■ K {as in scenario A), the data have covariance OAZKZ' + 0.2Z {K ■ 
K)Z' + OAInr- As in scenarios A and B, the d\ minimizing KL-divergence is larger than (Table [Hi, 
while was always 0.40. 


Population / source 

species 

size (n) 

A 

B 

C 

Swedish regmap 

A. thaliana 

298 

0.53 

0.58 

0.53 

Hapmap 

A. thaliana 

350 

0.47 

0.60 

0.48 

Van Heerwaarden et al. 

Z. mays 

400 

0.50 

0.58 

0.50 

Zhao et al. 

0. sativa 

413 

0.51 

0.52 

0.50 

Unrelated individuals 

simulated 

3000 

0.40 




Table 1: Values of the additive genetic variance ia\) minimizing the Kullback-Leibler divergence 
KL{Q,P) with respect to the true distribution (Q) of scenarios A-C, with P contained in models 

HI Minimization was performed by evaluating KL-divergence on the grid 0, 0.01,... , 1 for all variance 
components, under the constraint they sum to one. Five populations were considered: the Arabidopsis 
Hapmap and Swedish regmap liSll. the rice population from ifTTl . the maize population of lIT^ and 
a simulated population (File SI). Except for the latter, there are r = 2 replicates of each genotype. 

In addition to the analysis of KL-divergence we analyzed simulated traits for the first 4 populations, 
for which we found similar or even larger bias (File S2). This has important implications, in particular for 
immortal populations, for which genetically identical replicates are available (e.g. A. thaliana, agronomic 
crops, bacteria and fungi). Typically there is strong population structure and often only several hundreds 
of different genotypes are phenotyped. One can analyze such data at individual level (model O or at 
the level of genotypic means (model [TJ with divided by the number of replicates). Q showed that 
in the latter type of analysis, standard errors of heritability estimates can be huge, and recommended 
model [2] for both heritability estimation and genomic prediction. Here we have shown that in presence 
of non-additive effects, this model is likely to overestimate additive genetic variance. If however the 
non-additive effects are due to epistatic interactions, analysis at genotypic means level (model [H will 
(apart from the large sampling variance) also give inflated estimates of additive genetic variance. This 
is a rather realistic scenario, since epistasis may be an important part of the genetic architecture ( |[T9l ). 
and several other types of non-additive effects can be ruled out or minimized for immortal populations: 
e.g. genotype by environment interactions are unlikely in homogeneous controlled environments with 
adequate randomization, and dominance effects are impossible when using inbred lines. 

Interestingly, the inflation of additive genetic variance is not due to any non-linearity or absence of 
main effects, but rather the population structure present in the epistatic GSM, which to some extent re¬ 
sembles the structure of the GSM for the additive effects. At the same time, it is this structure that makes 
the epistatic GSM distinguishable from the diagonal error. This suggests that epistatic interactions are 
easier to model in structured populations, i.e. sampling variance of epistatic variance components may 
not be as large as in unstructured human populations ( EOll '). Expressions for the asymptotic variance 
in a model with both additive and epistatic effects (Eile S3) indicate that this is indeed the case. More 
generally, the inflation of heritability estimates due to misspecification illustrates the difficulty of mod- 
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eling and estimating genetic effects. As recently pointed out by ||3l this is already challenging for the 
additive genetic effects, in the sense that depending on the genetic architecture different GSMs may be 
appropriate. Indeed, the potential bias resulting from an inappropriate GSM could be assessed by evalu¬ 
ating KL-divergence with respect to the true model, as is the case for alternatives for the epistatic GSM 
considered here. 
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